Project Goal: Threefold: (1) Transformation and cleaning of data. (2) Exploratory Data Analysis of Dating Site Data. (3) Machine Learning modelling.

Introduction

The objective of this study is to investigate the critical factors that contribute to an individual’s appeal, popularity, and recognition within an online dating platform. The data utilised for this research is sourced from Lovoo, a prominent European dating application, and is accessible via Kaggle.

The underlying motivation for this study stems from the desire to comprehend behavioural patterns that transcend the confines of physical attractiveness. The aim is to unveil hidden determinants that may shape interpersonal interactions within a digital dating platform. The behaviour exhibited on these platforms carries significance, even in economic contexts. By deciphering this behavioural paradigm, it can potentially contribute to the development of economic models. These enhanced models can subsequently offer a more profound analytic framework to elucidate overall mate-selection behaviour.

The initial phase of the analysis involves the transformation of raw data into a more interpretable format. This includes the creation of additional variables tailored to augment the predictive capacity of the statistical models employed in subsequent stages. This phase facilitates the exploratory aspect of the research, enabling an in-depth examination of data in search of potential predictor variables. The objective extends beyond understanding the phenomena; the aim is to anticipate which factors instigate an increased number of profile views and, subsequently, the ‘likes’ received.

The modelling process is a two-step approach. The first stage focuses on identifying variables that may elucidate why individuals view a certain profile. Potential variables include online presence, age, geographical location, and the timing of an individual’s online activity. The second stage aims to identify factors that influence the likelihood of a profile receiving ‘likes’. These may include the number of pictures on a profile, the characteristics of a profile’s biography, languages spoken, profile verification status, and mobile usage.

The project employed decision tree models to analyze the intricate patterns influencing user behaviour. Decision trees offer a clear and comprehensible framework to identify the complex characteristics that impact outcomes. Using these models, the analysis yielded insightful findings on unique aspects of user engagement on the dating platform. Two decision tree models were developed, each focusing on predicting different facets of user behaviour, namely profile views and ‘likes’. This approach facilitated a deeper and more nuanced comprehension of the drivers behind these two critical indicators of user engagement.

The results indicated a strong correlation between certain variables like online presence, age, mobility, and timing of online activity with the number of profile views. However, profile views were found to be the only predictor that significantly influences the ‘likes’ received by a profile. Thus, these decision tree models effectively shed light on the variables that play pivotal roles in digital dating platforms, and as such in mate searching behaviour.

Part 1: Transformation and Cleaning

The dataset in consideration comprises 3 973 observations approximately 30 variables, each encapsulating specific attributes pertaining to individual profiles and related demographic information. An excerpt of the dataset is provided in Table 1, supplemented by Table 2 which provides more descriptives of a selection of significant variables. It’s noteworthy to mention that the dataset solely encompasses data of individuals identifying as female. As such, the core objective of this analysis is to discern the determinants influencing the behavioural patterns of individuals displaying interest in females.

Table 1: Head of dataframe.

age counts_details counts_pictures counts_profileVisits counts_kisses counts_g flirtInterests_chat verified lang_count lang_de whazzup freetext
25 1.00 4 8279 239 3 TRUE 0 1 TRUE Nur tote fische schwimmen mit dem strom Nur tote Fisch schwimmen mit dem Strom
22 0.85 5 663 13 0 TRUE 0 3 TRUE Primaveraaa<3 NA
21 0.00 4 1369 88 2 FALSE 0 0 FALSE NA NA
20 0.12 3 22187 1015 3 TRUE 0 2 FALSE Je pense donc je suis. Instagram quedev NA
21 0.15 12 35262 1413 12 TRUE 0 1 TRUE Instagram: JESSSIESCH NA

Table 2: Description of variables in data set.

Variable Description
age Age of the individual.
counts_details How complete the profile is. Proportion of detail in the account. Measured from 0.0-1.0.
counts_pictures How many pictures does the profile contain.
counts_profileVisits How many times the profile has been viewed.
counts_kisses Number of ‘kisses’ or ‘likes’ received by profile.
counts_g Number of group interactions which could represent the number of times a user has been added to a group or mentioned in a group chat
flirtInterests_* What the individual is interested in. ’*’ represents: ‘chat’, ‘date’, ‘friends’.
verified Whether the profile has been verified or not.
lang_count Number of languages spoken by an individual.
lang_* Language spoken by an individual. ’*’ represents: ‘en’ (English), ‘de’ (German), ‘fr’ (French), ‘it’ (Italian), ‘es’ (Spanish).
whazzup/freetext A set of phrases that represent the profile’s ‘bio’.
isMobile Whether an individual can arrange transport for themselves.

The original dataset is already quite useable, but we can produce better models by adding some new variables. The first step is to take a closer look at the language people use in their profiles. I am focusing on two main things here: the words used in the profile descriptions, and the use of emojis. Both of these could give insights into the person’s confidence and desirability.

I created two new dummy variables, has_emoji and contains_popular_word. has_emoji attributes a ‘1’ based on whether wazzup or freetext contains an emoji. contains_popular_word attributes a ‘1’ based on whether wazzup or freetext contains a popular word. Figure 1 below shows which words are the most popular by means of a word cloud. (The word cloud is a dynamic image that shows the popularity when hovering over a specific word)

Part 2: Exploratory Data Analysis

This segment aims to identify underlying patterns and relationships within the dataset. An initial step involves visually inspecting the variables, helping to assess their potential relevance and impact on the outcomes of interest. As a fundamental part of exploratory data analysis, these visual inspections allows one to discern which features could be instrumental in shaping predictive models.

In order to illuminate the relationships between the variables, a correlogram has been produced, which reveals some notable insights. For instance, there is a strong correlation between counts_kisses and counts_profileVisits (as expected), as well as a strong correlation between counts_g, counts_kisses and counts_profileVisits. A slight negative correlation is observed between profile views and factors such as age, interests leaning towards ‘just friends’, and shareability of the profile. On the contrary, having a verified status and showcasing multilingual abilities are positively correlated with profile likes, signifying their potential influence in enhancing a profile’s appeal. Interestingly, one of the paid features of the app, isHighlighted (which highlights your profile), shows no significant correlation with profile likes or views.

The new variables created also show some positive relationships to profile likes and views. Specifically, bios containing emojis and social media tags have a correlation coefficient of 0.14 with profile likes. This points to there possible usefulness in predicting popularity.

Figure 2: Correlogram of profile characteristics and number of likes received.

As hinted in the introductory section, it quickly becomes apparent that specific variables have a more pronounced influence on the number of profile ‘Likes’, while others may largely dictate the number of ‘Profile Views’. This distinction is crucial, as certain profile elements only become observable once a profile is viewed. For instance, the information in a profile biography only comes into play during a profile view. Therefore, the dynamics of what draws views and subsequently encourages likes may differ significantly, although both are important aspects of profile engagement.

Interestingly, despite these differences, one notices a robust correlation between profile views and likes. This interplay implies that a successful profile is not just about attracting views but also about converting those views into likes. Figure 3 visually represents this relationship, further illuminating the interdependent nature of profile views and likes. Uncovering these patterns provides essential insights that can inform our subsequent modeling efforts.

Figure 3: Bubble plot of profile views and number of pictures in profile. A non-linear model (loess method) was fitted on the plot to discern possible patterns and differences between bios with emojis and those without. Size of dots present number of group interactions.

Biography characteristics and popularity

Figure 4 below aims to present whether there is a difference in the distribution of likes received based on the newly created dummy variables, has_emoji, contains_popular_word, and night_owl. There seem to be some slight differences in likes received, supporting the idea that the use of emojis and certain words do suggest higher levels of trust. Interestingly, being online seems to be negatively associated with profile views as is shown in the right panel of figure 4. However, usually being online during night time also may increase profile views, but I view this variable more as a control variable, rather than a causal one, as more people tend to be online during night time than in day time.

Figure 4: Violin plots showing effects of profile characteristics on popularity. Left panel shows effect of a bio containing social media particulars and/or an emoji on likes received. Right panel shows effect of an online profile and/or being a night owl on number of profile visits.

In terms of the structure of an individual’s bio, I have utilized bio length as a proxy for word complexity, with the hypothesis that longer bios may reflect a higher degree of linguistic complexity. The underlying assumption is that users who write longer bios may use a wider range of vocabulary and complex sentence structures, reflecting their capacity to express intricate thoughts or feelings. However, it’s important to note that length does not necessarily equate to complexity — short bios can also be highly nuanced and complex while long bios might be repetitive or simplistic. Hence, although bio length provides a starting point for analysis, more sophisticated measures of textual complexity could be desirable for a more comprehensive understanding.

The scatterplot in the right panel of figure 5 suggests there is no clear linear relationship between the standardized complexity of bios and the count of kisses received, indicating that a more complex bio does not necessarily attract more interactions. Although the presence of emojis differentiates two clusters, it doesn’t appear to be a strong influence on the count of likes either. This analysis challenges initial assumptions about what factors might drive interaction. However, it’s possible that other variables not considered here, such as user activity level or profile picture, could be more impactful. While the plot offers initial insights, it also points towards the need for a more comprehensive exploration of factors influencing user interaction.

Figure 5: Visualisation of bio complexity. Left panel shows distribution of length of bio. Right panel shows a scatter plot between bio length and number of likes received; a linear model was fitted to scrutinise any possible relationship between the variables.

Geographical characteristics and popularity

Utilising the Google Maps API, I successfully geocoded the locations of all profiles present in the dataset. The primary objective behind this was to explore and visualise the potential impact of geographical location on profile views. The role of location might be significant, considering how geographical and cultural aspects can influence user interactions and preferences on the platform. The code snippet below shows the process to perform the geocoding operation.

The subsequent bubble plot illustrates a few disparities among cities. However, these contrasts are not significant enough to confirm any clear geographical trends in profile views. Thus, it is not feasible to definitively say that some regions show more inclination towards profile views than others based on this representation.

To gain a more insightful understanding, a choropleth map is utilized. This geographical representation not only gives a visual interpretation of data but also enhances comprehension through color-coding. Upon implementing this, it becomes noticeably clear that certain countries indeed experience higher profile views on average.

In particular, profiles originating from Spain, Hungary, and the Netherlands tend to attract more attention compared to other European countries. The reasons behind these trends can be plenty - cultural nuances, user behaviors, or the presence of more active users in these regions. Future investigation might delve deeper into these aspects to provide more concrete explanations for the observed patterns.

Figure 6: Geographic data visualisation of profile views. Size and colour of bubbles in left panel indicate profile views. Colour of country in the right panel indicate profile views.

When the data is visualized on a map, one notices that profiles from certain countries tend to get more views. But there’s more to the story than geography.

I produced a lollipop chart (figure 7 below) to show the number of users in each region, with the colour of the lollipop indicating mean profile views. What we see is interesting - a country’s overall popularity didn’t necessarily match up with the number of profile views. This discrepancy can be chalked up to what we call ‘sample size bias.’ Simply put, countries with less users naturally have a higher total number of views, due to a few very popular individuals pushing up the numbers.

As it turns out that using a profile’s country of origin as a way to predict its popularity might be misleading. To make the final model as accurate as possible, it was decided to leave this variable out of the mix.

Figure 7: Lollipop chart of number of users by country. The colour of the lollipops indicate mean profile views.

Other profile characteristics and popularity

In this sub-section, the objective is to ascertain the impact of various profile attributes on the degree of popularity experienced on the dating application. The attributes under scrutiny regard an array of factors, including the number of pictures a profile has, its verification status, and whether it can be shared, among others.

Having a verified profile is slightly more positively related to the number of likes received, then it is for those profiles that are not verified. The figure below shows a scatter plot of profile likes and views, coloured based on the verification status of the profile. Interestingly, if a profile is not shareable, this relationship inverts. However, this is possibly due to sample size bias, as there are much fewer profiles that are unshareable than those with share profile enabled.

Figure 8: Scatterplot of profile likes and views by verification and profile share status. In the left panel, only profiles with share profile enabled are shown, with the right panel only showing profiles with share profile disabled. A linear model was fitted to show the relationship more clearly.

Previous analyses revealed a mild negative correlation between age and both profile visits and likes, which is visualized in Figure 9. This figure comprises two scatterplots: the left panel depicts profile visits as a function of age, and the right panel similarly showcases profile likes as a function of age. Overlaying these scatterplots, a linear model has been fitted to more clearly illustrate the overall trend in the data. This downward trendline indicates that as the age of a user increases, the number of profile visits and likes tends to decrease slightly. While this trend is relatively mild, it suggests that younger users on the platform tend to garner more profile visits and likes compared to their older counterparts.

Figure 9: Scatterplot of profile likes and views by age. Left panel shows profile visits by age, where the right panel shows profile likes by age. A linear model was fitted to show the relationship more clearly.

The significance of language as a determinant of popularity is also explored in this analysis. This is reflected in Figure 10, where the number of languages spoken was considered as a potential predictor of popularity. Subsequently, Figure 10 provides a visualisation of the distribution of received likes in relation to specific languages spoken by the profiles.

Despite these considerations, the investigation does not reveal a discernible difference in the distribution of profile likes contingent on the languages spoken. The absence of any substantial differentiation in this context suggested that the language factor may not hold significant sway over profile popularity. Consequently, the language variable was not included in the formulation of the final predictive models. However, the number of languages spoken does show a mild positive correlation with profile likes.

Figure 10: Ridgeline plot of languages spoken and number of likes received. Dashed line shows overall mean profile likes.

The perceived attractiveness of a profile is often regarded as a significant determinant of mate searching behaviour. However, the dataset at hand does not include any direct measures of perceived attractiveness. Nevertheless, we have access to a proxy for this attribute, namely the number of pictures present in a profile. While it may not be the most accurate representation of attractiveness, it offers some insight into the visual appeal of a profile.

In conjunction with this, the presence of social media tags on a profile was also examined, given that these tags may serve as additional indicators of social validation or popularity.

Upon examining Figure 11, we observe a correlation between the number of pictures in a profile and the number of profile likes. Specifically, profiles with a lower number of pictures tend to have fewer profile likes, compared to their counterparts with a similar number of pictures but also featuring social media tags. As the number of pictures increases, the distinction between profiles with and without social media tags becomes less apparent.

This implies that while social media tags can enhance the visibility of a profile, their impact diminishes as the number of pictures increases. Thus, the number of pictures in a profile, serving as a rudimentary indicator of attractiveness, can also influence the popularity of a profile to some degree.

Figure 11: Dotted line plot of the number of pictures in profile and likes received. The mean number of likes received by number of photos was used to plot this relationship. Lines split based on social media tag presence in profile.

Part 3: Modeling User Engagement with Decision Trees

After data preparation and exploration, I proceeded with modelling. As a part of the modelling approach for this analysis, I have elected to implement two popular machine learning techniques: decision trees and random forests. These methods were chosen due to their interpretability, effectiveness in handling complex datasets, and their capacity for both classification and regression tasks.

In each of these chosen techniques, two separate models were trained to serve distinct predictive purposes. The first model targets the prediction of profile views, while the second model aims at forecasting profile likes. This dual-model approach was adopted in recognition of the distinct factors that could potentially influence these two different measures of user engagement. Each model is trained on a different set of predictor variables, carefully chosen based on the insights gathered during the data exploration phase.

The first step involves partitioning it into training and testing subsets. I proceed to divide the dataset into training and testing subsets. For this analysis, I have adopted the widely used practice of a 70/30 split, whereby 70% of the data forms the training set and the remaining 30% is reserved for testing. This allocation ensures a balance - ample data to train the model effectively, whilst retaining a substantial portion for assessing the model’s performance with unseen data. The code demonstrated below provides the method I employed to execute this data split. Moreover, I undertook this process twice. This resulted in two distinct sets - one set for profile views and another for profile likes, enabling a targeted examination of each aspect of profile engagement.

This section of the report applies Decision Tree models to predict user engagement, specifically through profile visits and likes. Decision Trees are a type of predictive modeling that use a tree-like graph or model of decisions based on certain conditions. For this analysis, I’ve constructed two distinct models.

The first model, targeting profile visits, considers five variables: isOnline, night_owl, age, genderLooking, verified, and isMobile. These variables represent a range of user behaviors and characteristics that might impact the likelihood of a user’s profile being visited.

Our second model shifts focus to profile likes. It takes into account a broader set of variables, which includes: Profile_Views, counts_g, bio_length, has_emoji, has_social, counts_pictures, lang_count, flirtInterests_chat, flirtInterests_date, flirtInterests_friends, isMobile, verified, and shareProfileEnabled. These variables were selected based on their perceived relevance to a user’s profile attractiveness and likability.

Both models were trained using a custom tuning grid and 10-fold cross validation to optimize the complexity parameter (cp).

Profile visits model

I employed the Classification and Regression Trees (CART) model to investigate the relationship between the chosen six predictors and the target variable, which I divided into four categories: ‘Low’, ‘Low Mid’, ‘High Mid’, and ‘High’.

To optimize the model’s performance, I adjusted the complexity parameter (cp), a key hyperparameter in decision tree models. This parameter controls the decision tree’s size and, by extension, its complexity. Lower cp values allow larger and more complex trees, which could potentially lead to overfitting. Conversely, higher cp values restrict the tree’s growth, leading to smaller and simpler models.

Figure 12: Profile visits model learning curve.

I explored a range of cp values from 0.001 to 0.1 and assessed the model’s accuracy at each point. The model exhibited its best performance - an accuracy of approximately 35.18% - with cp values within the 0.005 to 0.010 range. Similarly, the kappa statistic, which measures the concordance between the predicted and actual classifications, adjusting for chance agreement, mirrored this behavior, peaking at around 0.1357 for the same cp values.

The cp value of 0.01, which yielded the highest accuracy, was therefore chosen for the final model. However, it is important to note that an accuracy of 35.18%, though the best within this model configuration and dataset, may not suffice depending on the specific requirements of a given application. The low accuracy indicates that the model correctly predicts the outcome only about 35% of the time.

Figure 13: Profile visits model decision tree.

The decision tree model was trained on a dataset of 2779 observations and resulted in different decision rules that split the data to predict the category of ‘Profile_Views’.

The root node, encompassing all observations, was initially divided based on the ‘verified’ status of the user. The majority of the users in this node fell into the ‘Low’ category for profile views.

For users who are not verified (i.e., ‘verified’ < 0.5), the data was further split based on whether they are online.

For users who are online, the model made further distinctions based on the user’s age. Users aged 19.5 years and above were predominantly in the ‘Low’ profile views category. However, among the users under the age of 19.5, those who are mobile (‘isMobile’ >= 0.5) were mainly in the ‘High’ profile views category, while those who aren’t mostly fell into the ‘Low’ profile views category.

For users who are not online, the decision was again dependent on age. Users aged 23.5 years and above mostly had ‘Low’ profile views, while users below that age were split based on their mobility. Users who are non-mobile fell into the ‘Low Mid’ profile views category, while the ones who are mobile were largely in the ‘High’ category.

For users who are verified (i.e., ‘verified’ >= 0.5), the model predicted a majority of them to be in the ‘High’ profile views category.

In summary, the decision tree model suggests that factors such as verification status, online status, age, and mobility play a significant role in the number of profile views a user receives. This analysis could guide platform development and user engagement strategies.

## Confusion Matrix and Statistics
## 
##           Reference
## Predicted  Low Low Mid High Mid High
##   Low      209     154      132  106
##   Low Mid   19      28       16    7
##   High Mid   0       0        0    0
##   High      71     116      150  185
## 
## Overall Statistics
##                                                
##                Accuracy : 0.3537               
##                  95% CI : (0.3266, 0.3816)     
##     No Information Rate : 0.2506               
##     P-Value [Acc > NIR] : 0.00000000000000159  
##                                                
##                   Kappa : 0.1381               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.6990        0.09396          0.0000      0.6208
## Specificity              0.5615        0.95307          1.0000      0.6235
## Pos Pred Value           0.3478        0.40000             NaN      0.3544
## Neg Pred Value           0.8480        0.75957          0.7502      0.8316
## Prevalence               0.2506        0.24979          0.2498      0.2498
## Detection Rate           0.1752        0.02347          0.0000      0.1551
## Detection Prevalence     0.5038        0.05868          0.0000      0.4376
## Balanced Accuracy        0.6303        0.52352          0.5000      0.6221

The model’s overall accuracy is 0.3537, meaning it correctly predicted the ‘Profile_Views’ category approximately 35.37% of the time. This accuracy is significantly better than the No Information Rate (NIR), which represents the accuracy that could be achieved by always predicting the most frequent class - in this case ‘Low’. The NIR is 0.2506, and the P-value is practically 0, suggesting that the improvement of the model over the NIR is statistically significant.

The Kappa statistic of 0.1381 suggests that the agreement between the model’s predictions and the actual values is poor, as a Kappa of 1 indicates perfect agreement, while a Kappa near 0 indicates agreement equivalent to random chance.

Looking at the class-specific statistics:

For the ‘Low’ class, the model has a sensitivity (true positive rate) of 0.6990, meaning it correctly identified approximately 69.9% of the actual ‘Low’ instances. Its specificity (true negative rate), which measures how well the model identified ‘non-Low’ instances, was 0.5615.

For the ‘Low Mid’ class, both sensitivity and specificity were low (0.09396 and 0.95307 respectively), indicating that the model struggled to correctly identify ‘Low Mid’ instances but was good at identifying ‘non-Low Mid’ instances.

For the ‘High Mid’ class, the model didn’t make any predictions, which suggests it struggled with this category.

For the ‘High’ class, sensitivity was 0.6208 and specificity was 0.6235, indicating a moderate performance in identifying both ‘High’ and ‘non-High’ instances.

In summary, while the model did perform significantly better than a naive model (as evidenced by the p-value for Accuracy > NIR), there is considerable room for improvement, especially in accurately predicting the ‘Low Mid’ and ‘High Mid’ categories.

Profile likes model

The CART model to predict profile likes was calibrated across a range of cp values, spanning from 0.001 to 0.1, to determine the optimal performance point. The model showcased its top performance—an accuracy of approximately 73.4%—with a cp value of 0.002. The kappa statistic reflected this trend, reaching its peak of 0.645 at a cp value of 0.002.

The decision to select 0.1 as the cp value for the final model, despite the peak accuracy and kappa values attained at 0.002, was made to provide a balance between model complexity and prediction accuracy, helping to prevent overfitting.

Figure 14: Profile likes model learning curve.

The decision tree output depicts a tree-based model’s decision rules used to classify observations based on the variables ‘Profile_ViewsHigh’, ‘Profile_ViewsHigh Mid’, and ‘Profile_ViewsLow Mid’.

The tree starts with the root (node 1), which contains all 2779 observations. At this node, the model predicts ‘Low’ as it’s the most common class. The proportions of each class in this node are 25.51% for ‘Low’, 24.90% for ‘Low Mid’, 24.15% for ‘High Mid’, and 25.44% for ‘High’.

The first split occurs on ‘Profile_ViewsHigh’. If ‘Profile_ViewsHigh’ is less than 0.5, we move to node 2, which contains 2084 observations. The model predicts ‘Low’ at this node, and the class proportions have changed due to the split: 33.93% for ‘Low’, 33.11% for ‘Low Mid’, 27.02% for ‘High Mid’, and 5.95% for ‘High’. This indicates that instances with ‘Profile_ViewsHigh’ less than 0.5 are more likely to be ‘Low’ or ‘Low Mid’.

Node 2 further splits into node 4 (if ‘Profile_ViewsHigh Mid’ is less than 0.5) and node 5 (if ‘Profile_ViewsHigh Mid’ is greater than or equal to 0.5). Node 4 then splits again on ‘Profile_ViewsLow Mid’. Terminal nodes, denoted by an asterisk (*), are reached when ‘Profile_ViewsLow Mid’ is less than 0.5 (node 8, predicting ‘Low’) or greater than or equal to 0.5 (node 9, predicting ‘Low Mid’).

Returning to node 3, if ‘Profile_ViewsHigh’ is greater than or equal to 0.5, we have 695 observations, where the model predicts ‘High’. This node does not split further, making it a terminal node (node 3*).

Figure 15: Profile likes model decision tree.

The confusion matrix reveals how the Classification and Regression Tree (CART) model’s predictions fare against the actual outcomes.

Looking at the confusion matrix, the model correctly predicted 261 ‘Low’, 175 ‘Low Mid’, 186 ‘High Mid’, and 241 ‘High’ instances. However, there were instances where the model incorrectly classified the categories, as evidenced by the off-diagonal numbers.

The overall accuracy of the model is approximately 72.34%, suggesting that the model correctly classified about 72% of the total instances. The 95% confidence interval for the accuracy metric ranges from 69.71% to 74.86%, indicating the range in which we can expect the true accuracy of the model to lie, 95 times out of 100 if the experiment were repeated.

## Confusion Matrix and Statistics
## 
##           Reference
## Predicted  Low Low Mid High Mid High
##   Low      261      48        0    0
##   Low Mid   41     175       57    0
##   High Mid   3      70      186   57
##   High       0       1       53  241
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7234               
##                  95% CI : (0.6971, 0.7486)     
##     No Information Rate : 0.2557               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6311               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.8557         0.5952          0.6284      0.8087
## Specificity              0.9459         0.8910          0.8551      0.9397
## Pos Pred Value           0.8447         0.6410          0.5886      0.8169
## Neg Pred Value           0.9502         0.8707          0.8746      0.9365
## Prevalence               0.2557         0.2464          0.2481      0.2498
## Detection Rate           0.2188         0.1467          0.1559      0.2020
## Detection Prevalence     0.2590         0.2288          0.2649      0.2473
## Balanced Accuracy        0.9008         0.7431          0.7417      0.8742

The p-value associated with the accuracy being greater than the No Information Rate (NIR), which is the accuracy that could be achieved by always predicting the most frequent class, is extremely small (< 0.00000000000000022). This means that our model is statistically significantly better than a model that always predicts the most frequent class.

The Kappa statistic of 0.6311 suggests a reasonable level of agreement between the model’s predictions and the actual values, considering the agreement that might happen just by chance.

Inspecting the ‘Statistics by Class’, we see sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for each class. These metrics provide a more granular understanding of the model’s performance.

For instance, for ‘Low’, the sensitivity (or recall) is 85.57%, indicating that the model correctly identified 85.57% of the actual ‘Low’ instances. Its specificity is 94.59%, denoting that 94.59% of the time, the model correctly identified cases that were not ‘Low’. The PPV for ‘Low’ is 84.47%, which means when the model predicts ‘Low’, it’s correct 84.47% of the time. NPV of 95.02% means that when the model predicts a case is not ‘Low’, it is correct about 95.02% of the time.

The prevalence shows the proportion of each class in the data. Detection rate is the rate at which the model correctly identified each class, while detection prevalence is the rate at which the model predicted each class.

The balanced accuracy is the average of sensitivity and specificity, giving a more balanced measure of the model’s performance when classes are imbalanced.

It’s important to note that these metrics reveal that the model’s performance varies across different classes. It seems to perform best when predicting ‘Low’ and ‘High’ classes, while struggling more with ‘Low Mid’ and ‘High Mid’ classes.

Discussion and Conclusion

The decision tree models built for predicting profile visits and likes provide important insights into the dynamics of user engagement on the platform. However, the disparity in performance and explanatory power between the two models invites critical scrutiny and indicates areas for improvement.

Discussion on the Profile Visits Model

In the profile visits model, several issues arise from the variables considered, the model’s performance, and the structure of the decision tree.

Firstly, the chosen variables—isOnline, night_owl, age, genderLooking, verified, and isMobile—do seem intuitively relevant to the question of who might visit a user’s profile. However, the accuracy of the model suggests that these variables, or at least how they are used within this decision tree framework, do not provide a comprehensive or highly accurate prediction of user profile visits.

It’s crucial to explore why certain variables—such as night_owl and genderLooking—were dropped from the decision tree. This may be a result of these variables not having enough predictive power, or perhaps the specific algorithm used (CART in this case) did not find a beneficial split on these variables. It could also be an artifact of the tuning of the complexity parameter, which can lead to simpler trees. It might be beneficial to include these variables in future models or explore alternative modeling strategies that can better leverage these variables.

The model’s low accuracy (~35%) also warrants discussion. In many machine learning tasks, accuracy in this range might be considered very low. However, context matters: is this a difficult prediction task where 35% accuracy is actually a considerable achievement, or is it a failure of the model? Furthermore, the poor kappa statistic (~0.14) indicates that the model’s predictive performance is barely better than random chance. Given these statistics, there is likely substantial room for improvement in this model.

The decision tree itself also raises some interesting points. The heavy reliance on the verified variable, with the tree first splitting on this, may suggest that this variable is of high importance. However, it might also mask the potential contributions of other variables if verified is interacting with them in a way not captured by the decision tree. For example, if being verified only increases profile visits for online and mobile users, then a decision tree split like this would not adequately model that interaction.

Discussion on the Profile Likes Model

Comparatively, the profile likes model demonstrated a much higher accuracy (~73%), indicating that it is a better predictor of user likes than the profile visits model was of user visits. The model also maintained a much stronger kappa statistic (~0.63), implying a good agreement between predictions and actual values.

However, even this better-performing model is not without its limitations and criticisms. For one, a larger number of variables were considered in the profile likes model. It included interaction terms (Profile_Views x counts_g and bio_length x (has_emoji + has_social)) and other profile features such as counts_pictures, lang_count, flirtInterests_chat, flirtInterests_date, flirtInterests_friends, counts_details, isMobile, verified, and shareProfileEnabled.

Notably, all variables (save for Profile_Views) were dropped from the decision tree model, likely due to similar reasons as the profile visits model. This again raises questions about the suitability of these variables for this kind of model and whether other modeling strategies might be more appropriate.

In the decision tree structure, the division based on the Profile_Views variable’s categories (‘High’, ‘High Mid’, ‘Low Mid’) implies an excessive reliance on this variable by the model. This could potentially be explained by multicollinearity, where the number of profile views and the number of likes are closely related.

Conclusion

The project has offered compelling insights into the factors that influence user behavior on digital dating platforms. By employing decision tree models, it was possible to explore and predict the complex web of interactions shaping the number of profile views and ‘likes’ received. The results, however, underline the complexity in mate searching behaviour and was largely unsuccessful in using the predictors to determine profile views and likes. Ultimately, however, these findings could potentially be harnessed to not only enrich user experience on these platforms, but also contribute to the development of more nuanced economic models and theories around online behavior and digital interaction. The potential for further research in this area is immense, with opportunities for exploring and understanding the dynamics of interpersonal interactions in the digital world.